Changing how pre-trained models behave -- e.g., improving their performance on a downstream task or mitigating biases learned during pre-training -- is a common practice when developing machine learning systems. In this work, we propose a new paradigm for steering the behavior of neural networks, centered around \textit{task vectors}. A task vector specifies a direction in the weight space of a pre-trained model, such that movement in that direction improves performance on the task. We build task vectors by subtracting the weights of a pre-trained model from the weights of the same model after fine-tuning on a task. We show that these task vectors can be modified and combined together through arithmetic operations such as negation and addition, and the behavior of the resulting model is steered accordingly. Negating a task vector decreases performance on the target task, with little change in model behavior on control tasks. Moreover, adding task vectors together can improve performance on multiple tasks at once. Finally, when tasks are linked by an analogy relationship of the form ``A is to B as C is to D", combining task vectors from three of the tasks can improve performance on the fourth, even when no data from the fourth task is used for training. Overall, our experiments with several models, modalities and tasks show that task arithmetic is a simple, efficient and effective way of editing models.
translated by 谷歌翻译
我们提出了分支机构 - 培训 - 合并(BTM),这是一种用于对大型语言模型(LLMS)平行训练的沟通效率算法。我们表明,有可能在不同的数据子集上独立训练新的LLMS的子部分,从而消除了训练LLMS当前所需的大量多节点同步。 BTM学习了一组独立的专家LMS(ELMS),每个LMS(ELMS)专门针对不同的文本领域,例如科学或法律文本。可以添加和删除这些榆树以更新数据覆盖范围,并结合概括为新域,或者平均折叠回到单个LM以进行有效推理。通过从当前集合中的(混合物)分支,进一步训练新域的数据参数,然后将结果模型归还到该集合以备将来使用,从而学习新的榆树。实验表明,在控制训练成本时,与GPT型变压器LMS相比,BTM改善了与GPT风格的变压器LMS相比,可以改善内部和外部困惑。通过广泛的分析,我们表明这些结果对不同的ELM初始化方案是可靠的,但需要专家领域的专业化。具有随机数据拆分的LM合奏表现不佳。我们还提出了将BTM缩放到64个领域的新语料库(总计192B居民分开的代币)的研究;所得的LM(22.4B总参数)以及经过2.5倍计算训练的变压器LM。这些收益随域的数量增长,表明可以使用更具侵略性的并行性来有效地在未来的工作中训练更大的模型。
translated by 谷歌翻译
translated by 谷歌翻译
Language models pretrained on text from a wide variety of sources form the foundation of today's NLP. In light of the success of these broad-coverage models, we investigate whether it is still helpful to tailor a pretrained model to the domain of a target task. We present a study across four domains (biomedical and computer science publications, news, and reviews) and eight classification tasks, showing that a second phase of pretraining indomain (domain-adaptive pretraining) leads to performance gains, under both high-and low-resource settings. Moreover, adapting to the task's unlabeled data (task-adaptive pretraining) improves performance even after domain-adaptive pretraining. Finally, we show that adapting to a task corpus augmented using simple data selection strategies is an effective alternative, especially when resources for domain-adaptive pretraining might be unavailable. Overall, we consistently find that multiphase adaptive pretraining offers large gains in task performance.
translated by 谷歌翻译
Large-scale datasets for natural language inference are created by presenting crowd workers with a sentence (premise), and asking them to generate three new sentences (hypotheses) that it entails, contradicts, or is logically neutral with respect to. We show that, in a significant portion of such data, this protocol leaves clues that make it possible to identify the label by looking only at the hypothesis, without observing the premise. Specifically, we show that a simple text categorization model can correctly classify the hypothesis alone in about 67% of SNLI (Bowman et al., 2015) and 53% of MultiNLI (Williams et al., 2018). Our analysis reveals that specific linguistic phenomena such as negation and vagueness are highly correlated with certain inference classes. Our findings suggest that the success of natural language inference models to date has been overestimated, and that the task remains a hard open problem.
translated by 谷歌翻译